Search CORE

7 research outputs found

Αυτόματη αναγνώριση ανθρώπινων δράσεων με εμπλουτισμένες αναπαραστάσεις βίντεο

Author: Mavroudi Effrosyni
Μαυρουδή Ευφροσύνη
Publication venue
Publication date: 07/09/2015
Field of study

LOOKING INTO ACTORS, OBJECTS AND THEIR INTERACTIONS FOR VIDEO UNDERSTANDING

Author: Mavroudi Effrosyni
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 07/07/2023
Field of study

Automatic video understanding is critical for enabling new applications in video surveillance, augmented reality, and beyond. Powered by deep networks that learn holistic representations of video clips, and large-scale annotated datasets, modern systems are capable of accurately recognizing hundreds of human activity classes. However, their performance significantly degrades as the number of actors in the scene or the complexity of the activities increases. Therefore, most of the research thus far has focused on videos that are short and/or contain a few activities performed only by adults. Furthermore, most current systems require expensive, spatio-temporal annotations for training. These limitations prevent the deployment of such systems in real-life applications, such as detecting activities of people and vehicles in an extended surveillance videos. To address these limitations, this thesis focuses on developing data-driven, compositional, region-based video understanding models motivated by the observation that actors, objects and their spatio-temporal interactions are the building blocks of activities and the main content of video descriptions provided by humans. This thesis makes three main contributions. First, we propose a novel Graph Neural Network for representation learning on heterogeneous graphs that encode spatio-temporal interactions between actor and object regions in videos. This model can learn context-aware representations for detected actors and objects, which we leverage for detecting complex activities. Second, we propose an attention-based deep conditional generative model of sentences, whose latent variables correspond to alignments between words in textual descriptions of videos and object regions. Building upon the framework of Conditional Variational Autoencoders, we train this model using only textual descriptions without bounding box annotations, and leverage its latent variables for localizing the actors and objects that are mentioned in generated or ground-truth descriptions of videos. Finally, we propose an actor-centric framework for real-time activity detection in videos that are extended both in space and time. Our framework leverages object detections and tracking to generate actor-centric tubelets, capturing all relevant spatio-temporal context for a single actor, and detects activities per tubelet based on contextual region embeddings. The models described have demonstrably improved the ability to temporally detect activities, as well as ground words in visual inputs

JScholarship

Learning to Ground Instructional Articles in Videos through Narrations

Author: Afouras Triantafyllos
Mavroudi Effrosyni
Torresani Lorenzo
Publication venue
Publication date: 06/06/2023
Field of study

In this paper we present an approach for localizing steps of procedural activities in narrated how-to videos. To deal with the scarcity of labeled data at scale, we source the step descriptions from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks. Without any form of manual supervision, our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities: frames, narrations, and step descriptions. Specifically, our method aligns steps to video by fusing information from two distinct pathways: i) {\em direct} alignment of step descriptions to frames, ii) {\em indirect} alignment obtained by composing steps-to-narrations with narrations-to-video correspondences. Notably, our approach performs global temporal grounding of all steps in an article at once by exploiting order information, and is trained with step pseudo-labels which are iteratively refined and aggressively filtered. In order to validate our model we introduce a new evaluation benchmark -- HT-Step -- obtained by manually annotating a 124-hour subset of HowTo100M\footnote{A test server is accessible at \url{https://eval.ai/web/challenges/challenge-page/2082}.} with steps sourced from wikiHow articles. Experiments on this benchmark as well as zero-shot evaluations on CrossTask demonstrate that our multi-modality alignment yields dramatic gains over several baselines and prior works. Finally, we show that our inner module for matching narration-to-video outperforms by a large margin the state of the art on the HTM-Align narration-video alignment benchmark.Comment: 17 pages, 4 figures and 10 table

arXiv.org e-Print Archive

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Author: Feiszli Matt
Goyal Raghav
Mavroudi Effrosyni
Sigal Leonid
Sukhbaatar Sainbayar
Torresani Lorenzo
Tran Du
Yang Xitong
Publication venue
Publication date: 15/02/2023
Field of study

Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image region or sentence) and outputs (temporal segments or spatio-temporal tubes). However, at their core they require the same fundamental understanding of the video, i.e., the actors and objects in it, their actions and interactions. So far these tasks have been tackled in isolation with individual, highly specialized architectures, which do not exploit the interplay between tasks. In contrast, in this paper, we present a single, unified model for tackling query-based video understanding in long-form videos. In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark which entail queries of three different forms: given an egocentric video and a visual, textual or activity query, the goal is to determine when and where the answer can be seen within the video. Our model design is inspired by recent query-based approaches to spatio-temporal grounding, and contains modality-specific query encoders and task-specific sliding window inference that allow multi-task training with diverse input modalities and different structured outputs. We exhaustively analyze relationships among the tasks and illustrate that cross-task learning leads to improved performance on each individual task, as well as the ability to generalize to unseen tasks, such as zero-shot spatial localization of language queries

arXiv.org e-Print Archive

Recommended from our members

GEARing smart environments for pediatric motor rehabilitation.

Author: Galloway James C
Heinz Jeffrey
Kokkoni Elena
Mavroudi Effrosyni
Tanner Herbert G
Vidal Renè
Zehfroosh Ashkan
Publication venue: eScholarship, University of California
Publication date: 01/02/2020
Field of study

BACKGROUND:There is a lack of early (infant) mobility rehabilitation approaches that incorporate natural and complex environments and have the potential to concurrently advance motor, cognitive, and social development. The Grounded Early Adaptive Rehabilitation (GEAR) system is a pediatric learning environment designed to provide motor interventions that are grounded in social theory and can be applied in early life. Within a perceptively complex and behaviorally natural setting, GEAR utilizes novel body-weight support technology and socially-assistive robots to both ease and encourage mobility in young children through play-based, child-robot interaction. This methodology article reports on the development and integration of the different system components and presents preliminary evidence on the feasibility of the system. METHODS:GEAR consists of the physical and cyber components. The physical component includes the playground equipment to enrich the environment, an open-area body weight support (BWS) device to assist children by partially counter-acting gravity, two mobile robots to engage children into motor activity through social interaction, and a synchronized camera network to monitor the sessions. The cyber component consists of the interface to collect human movement and video data, the algorithms to identify the children's actions from the video stream, and the behavioral models for the child-robot interaction that suggest the most appropriate robot action in support of given motor training goals for the child. The feasibility of both components was assessed via preliminary testing. Three very young children (with and without Down syndrome) used the system in eight sessions within a 4-week period. RESULTS:All subjects completed the 8-session protocol, participated in all tasks involving the selected objects of the enriched environment, used the BWS device and interacted with the robots in all eight sessions. Action classification algorithms to identify early child behaviors in a complex naturalistic setting were tested and validated using the video data. Decision making algorithms specific to the type of interactions seen in the GEAR system were developed to be used for robot automation. CONCLUSIONS:Preliminary results from this study support the feasibility of both the physical and cyber components of the GEAR system and demonstrate its potential for use in future studies to assess the effects on the co-development of the motor, cognitive, and social systems of very young children with mobility challenges

eScholarship - University of California